This assignment is for ETC5521 Assignment 1 by Team numbat comprising of Aarathy Babu, Lachlan Moody, Dilinie Seimon, and Jinhao Luo.
2020 was a bad year for passwords. A recent audit of the ‘dark web’ reported on by Forbes unveiled that over 15 billion stolen logins were currently circulating online Winder, 2020. As stated in the article, for perspective, this represents two sets of account logins for every person on the planet.
This was the result of more than 100,000 data breaches relating to cybercrime activities, a 300% increase since 2018. So in an age where everybody is leaving an ever growing digital record of their activities from social media to banking, what can the average person do to bolster their security online?
The following analysis will explore this current issue in depth using a compilation of some of the most commonly used passwords on the web. It should be noted however that the original data was compiled in September of 2014. There is a possibility therefore that the trends and findings discussed below are not entirely applicable to the modern day. To ensure full relevancy a more up to date collection would be required. However, it is reasonable to assume the underlying foundations of password security have not changed all that much in the past few years. Additionally, the strength rating provided is calculated relative to all the other passwords in the data set. As laid out in the provided documentation, as these common passwords are mostly all ‘bad’, a high strength rating does not necessarily indicate that a password is hard to crack. However, there are additional variables that allow this to be calculated. Detailed information of the data used and the research questions formulated are provided in the following section.
Based upon the motivations discussed above, the following research questions were formulated. The primary subject of interest being:
What are the characteristics of the most common passwords in the interest of security?
Once this exploration area was established, three questions were composed to parameterise the proceeding analysis. They were:
In order to address these areas and explore the field in greater depth, data was sourced from the book Information is Beautiful (2014). This contained information on 507 passwords derived from online databases Skullsecurity and DigiNinja collected in 2014. The data was provided in a tidy format and was read into R Studio in a csv format directly from the GitHub repository provided by Tidy Tuesday (2020) using the readr (2018) package. The data contained the following variables:
A visualisation of the data structure can be seen below in Figure 1 using the visdat package (2017).
Figure 2.1: Initial Data Structure
Figure 2.2: Missing Data Values
On further investigation there appeared to be 7 blank rows at the end of the dataset. These observations were subsequently removed using dplyr (2020) as they may have negatively impacted the proceeding analysis and provided no tangible value. The final resulting data frame had 500 observations of 9 variables.
To address the first research question that was proposed, ‘What are the common trends among the most commonly used passwords?’ the following analysis was undertaken.
Firstly, table 3.1 below provides a brief overview of the data that has been collected, containing the variables relating to the password used, its rank in the data set, the associated category and its relative strength. Simply looking at this table shows that many of the top ranked passwords in terms of popularity are quite simple in nature, containing ordered number or alphabetical series and simple words such as ‘baseball’ and ‘football’. Surprisingly, the word ‘password’ itself holds the number one spot on the list. This table was made interactive using the DT (2020) package, allowing users to uncover what passwords made the top 500 and to see if even their own password is on the list.Figure 3.1: Top 500 Most Popular Passwords
Following this, a visual representation of the data was required to see if any further patterns could be uncovered that did not present themselves in the above tabular format. A wordcloud was chosen for this purpose due to its ability to quickly convey information about textual data. As these graphics typically are produced using the frequency that a word is used a new variable was required as a proxy for this dataset. To achieve this the data was ordered in inverse rank and then a row-number was assigned such that the password with rank equal to 1 would have an associated row number of 500. The data was then restricted to the top 50 entries and coloured by category. The following plot is displayed in Figure 3.2 and was produced using the wordcloud package (2018).
Figure 3.2: 50 most popular passwords
From the above plot, a trend appears to emerge with many of the top ranked passwords belonging to a small subset of categories as evidenced by the predominance of the brown, purple and blue colour in the wordcloud. This alluded to the possibility that password popularity may be related to password category.
To explore this relationship more closely, the data was expanded from the previous top 50 restriction to include the entire dataset.
The data set was broken into 10 groups relating to the 10 provided categories and a tally was then taken of the amount of passwords in each group. These figures were then divided by the total number of passwords to get a proportion of each categories representation in the data. This is plotted in the interactive figure 3.3 below using plotly (2020).
Figure 3.3: Most popular password categories
The trend witnessed in the wordcloud appeared to carry over to the entire data set, with 65% of the top 500 passwords recorded belonging to only three of the categories; ‘name’, ‘cool-macho’ and ‘simple-alphanumeric’. In particular, ‘name’ dominated the list comprising over a third of the passwords recorded. Additionally, this supports the observation made from the data table that people may prefer simple passwords that are easy to remember. This may indicate an area of vulnerability for most people as their name, an easily identifiable piece of information, may be being used to protect potentially sensitive information.
Another area of interest in relation to this question was password length. While not originally included as a variable in the data it is easily calculated by counting the number of characters in the ‘password’ variable.
Once calculated, a frequency distribution was produced of the resulting variable. To better compare the length distribution across categories, a percent stacked bar chart was also plotted. The columns have been coloured relative to category and are the same across the two plots. Figure 3.4 was also made interactive using plotly (2020).
Figure 3.4: Most popular password lengths
First concerning the frequency distribution, there is a mostly normal distribution of the data with a slight left skew. There is however, a clear peak in the middle of the data set indicating that a majority of passwords in the data set are 6 characters long or more. This may be due to password requirements on many websites and programs that require a minimum number of characters rather than a personal preference. The longest password recorded, 9 characters, was the only one in its group supporting the previous statements that people may prefer simpler passwords.
While the first plot does give some indication to the distribution of categories across lengths, the second plot allows for more direct comparisons. First of all, the ‘name’ category appears to be the most popular across all lengths except among the smallest and largest groups. This is reasonable as it is less common for names to contain few or many characters, whereas 4 digit passwords lend themselves to a birth date or a year. A second area of interest is that the largest password in the data set is of the ‘simple-alphanumeric’ category. Upon further investigation this was found to be related to the string ‘123456789’. This is further evidence that most passwords are characteristically simple and easy to remember which is reasonable considering they need to be recalled on a regular basis. Perhaps unreasonably though, this does indicate a potential vulnerability in general password security.
Examining the in-depth summary table 3.1 produced using kableExtra (2019) below provides some additional insight into the data.
Passwords overall, on average, are 6.2 characters long evidenced by the slight left skew after 6 observed earlier, with most ranging between 6 and 7. The longest passwords meanwhile, tend to be in the ‘nerdy-pop’ category while the shortest are considered ‘fluffy’. Interestingly, though the largest password recored was type ‘alpha-numeric’, the category overall is the second smallest when comparing the mean length.
| Password Category | Minimum | Q1 | Median | Mean | Q3 | Maxium |
|---|---|---|---|---|---|---|
| Overall | 4 | 6 | 6 | 6.20 | 7 | 9 |
| animal | 4 | 6 | 6 | 6.21 | 7 | 8 |
| cool-macho | 4 | 6 | 6 | 6.25 | 7 | 8 |
| fluffy | 4 | 5 | 6 | 5.80 | 6 | 8 |
| food | 5 | 6 | 6 | 6.09 | 6 | 8 |
| name | 4 | 6 | 6 | 6.22 | 7 | 8 |
| nerdy-pop | 5 | 6 | 7 | 6.63 | 7 | 8 |
| password-related | 4 | 6 | 6 | 6.33 | 7 | 8 |
| rebellious-rude | 5 | 6 | 6 | 6.36 | 6 | 8 |
| simple-alphanumeric | 4 | 5 | 6 | 5.93 | 6 | 9 |
| sport | 4 | 6 | 6 | 6.51 | 7 | 8 |
Returning to the research question posed at the beginning, there are several statements that can be made relating to common trends among popular passwords. Firstly, passwords are generally chosen that are simple and easy to remember. Secondly, most passwords are related to a ‘name’ or can also be ‘cool-macho’ or ‘simple-alphanumeric’. Finally, most passwords tend to be 6 or 7 characters long. All these aspects of commonly used passwords have the potential to impact on overall password security. Thus the following two analysis sections will focus on this topic in greater depth.
The strength of these common passwords is an interesting feature to explore as variable strength is relative to the passwords in the dataset. Since these are commonly used passwords, their strength is expected to be less and easier to crack. The following analysis has been done to explore the dataset, to determine how strong the passwords are. Through out the analysis, the variable offline_crack_sec (the time taken to crack the password by offline guessing) is considered instead of the variable value, which depicts the time taken to crack the password by online guessing, as both of these values are proportional to each other and the results remain the same during comparisons between passwords.
Figure 3.5: 43.6 % of the passwords are relatively high in strength
In figure 3.5 above, it can be seen that about 43.6% of the commonly used passwords are passwords with relative strength between 8 and 10 on scale of 1-10 with 10 being the highest quality among these passwords. 35.4% of the passwords fall in the medium category having relative strength between 6 and 8 where as 9.2% have a weak strength of 4-6. Very Weak category passwords of strength 0-4 constitute around 8.8% of the passwords given. Around 3% of the top 500 common passwords are of strength above 10 which is an interesting outlier because it varies greatly from the strength scale limits of 1-10 set in the dataset description. Since these vary greatly from typical password strength, the passwords with strength more than 10 will not be included in the data analysis.
Another important characteristic to judge a password is to analyze the time taken to crack it. In order to analyze the time taken to crack the popular passwords, the top 10 common passwords are taken into consideration by using rank as a variable. As seen in figure 3.6, passwords like ‘1234’, ‘12345’, ‘123456’ and ‘12345678’ are so popular that it is very easily cracked taking approximately 0 seconds. Given the argument that popularity of the passwords is the reason that the passwords are predictable and therefore easily cracked, it is also interesting to see that “password” even though being the ranked one in popularity, is among passwords like ‘football’ and ‘baseball’ that take relatively more time to be cracked.
Figure 3.6: As expected, 1234 is quick to be cracked
The passwords in the dataset belong to 10 different categories like simple-alphanumeric, animal etc. To find which types of passwords are the strongest, the analysis focuses on the password strength and the time taken to crack them. In order to see the distribution of the strength of the passwords belonging to each category, a density plot is drawn below in figure 3.7 using the ggridges (2020) package. A median line is drawn so as to compare the strength across the categories. It can be seen from the plot that password types such as names, sport, cool-macho and nerdy-pop are much higher in strength than the other categories as 50 % of these passwords have strength higher than 8.
Figure 3.7: Password categories like ‘simple-alphanumeric’ have low strength compared to other categories
For further investigation of evidences to determine which type of passwords are among the strongest, time to crack the passwords are also evaluated. In order to do so the mean of the variable ‘offline_crack_sec’ is plotted against each category in the figure 3.8 below. The figure below shows an interesting pattern that shows the category ‘rebellious-rude’ passwords on an average takes the longest time to be hacked even though the median of the strength of its passwords are not as much as password types like ‘nerdy-pop’ and ‘sport’. A similar pattern is seen in ‘password-related’ type. It can also be seen that types like ‘fluffy’, even though it has high strength, the average time to hack its password is quite low.
Figure 3.8: Password categories like ‘simple-alphanumeric’,‘fluffy’ and ‘food’ are few of the weak categories
To answer the question of which password type is the strongest among these passwords, it can be said that ‘rebellious-rude’, ‘cool-macho’ type passwords are good contenders.
Through the exploratory data analysis of the dataset on Top 500 commonly used passwords, it was observed that most people tend to choose passwords that can be easily remembered, therefore a simple password that is related to a name or contains alphanumeric characters and roughly 6-7 characters long is chosen as password. On further exploration it was found that 43.6 % of the commonly used passwords are relatively high in strength and that around 3% of the passwords were of very high strength which varied greatly from typical passwords.
Furthermore ,it was observed that among the password categories, types ‘rebellious-rude’, ‘cool-macho’ are considered strong and take relatively more time to get hacked. Another striking discovery made while analyzing the data is that the hacking time and the strength of the passwords in the dataset is not under any strict relationship and that not all passwords with high strength take long to be cracked and also, not all passwords with low strength are cracked easily as there have been instances of high strength password being hacked quicker than a low strength password.
It can be concluded that most people choose common passwords that can be easily hacked and that using any of the passwords in the dataset is not recommended.
However, by reviewing the original report, two additional questions have been raised to improve the existing report. They are:
The further analysis of the relationship of the online and offline cracking times of each password will be done in order to understand the underlying factors that might be impacting them.
In addition, analysing the relationship between the types of character and the strength of passwords, like the type of the whole number, whole letter, uppercase-lowercase mixed, or number-letter mixed, which might give us the more clear sense of the reason of strong or weak passwords.
[1] Winder, D. (2020). New Dark Web Audit Reveals 15 Billion Stolen Logins From 100,000 Breaches. Retrieved 18 August 2020, from https://www.forbes.com/sites/daveywinder/2020/07/08/new-dark-web-audit-reveals-15-billion-stolen-logins-from-100000-breaches-passwords-hackers-cybercrime/#344d620180fb
[2] Mock, T. (2020). rfordatascience/tidytuesday. Retrieved 16 August 2020, from https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-14
[3] McCandleless, D. (2020). Knowledge is Beautiful, my new book — Information is Beautiful. Retrieved 18 August 2020, from http://www.informationisbeautiful.net/2014/knowledge-is-beautiful/
[4] Wood, R. (2020). Pipal, Password Analyser - DigiNinja. Retrieved 18 August 2020, from https://digi.ninja/projects/pipal.php
[5] Passwords - SkullSecurity. (2020). Retrieved 20 August 2020, from https://wiki.skullsecurity.org/Passwords
[6] Pie Charts. (2020). Retrieved 24 August 2020, from https://plotly.com/r/pie-charts/
[7] Elegant Visualization of Density Distribution in R Using Ridgeline - Datanovia. (2020). Retrieved 23 August 2020, from https://www.datanovia.com/en/blog/elegant-visualization-of-density-distribution-in-r-using-ridgeline/
[8] Claus O. Wilke (2020). ggridges: Ridgeline Plots in ‘ggplot2’. R package version 0.5.2. https://CRAN.R-project.org/package=ggridges
[9] Yihui Xie, Joe Cheng and Xianying Tan (2020). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.15. https://CRAN.R-project.org/package=DT
[10] Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
[11] Tierney N (2017). “visdat: Visualising Whole Data Frames.” JOSS, 2(16), 355. doi: 10.21105/joss.00355 (URL: https://doi.org/10.21105/joss.00355), <URL: http://dx.doi.org/10.21105/joss.00355>.
[12] Nicholas Tierney, Di Cook, Miles McBain and Colin Fay (2020). naniar: Data Structures, Summaries, and Visualisations for Missing Data. R package version 0.5.2. https://CRAN.R-project.org/package=naniar
[13] Hadley Wickham, Jim Hester and Romain Francois (2018). readr: Read Rectangular Text Data. R package version 1.3.1. https://CRAN.R-project.org/package=readr
[14] Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra
[15] Joe Cheng, Carson Sievert, Winston Chang, Yihui Xie and Jeff Allen (2020). htmltools: Tools for HTML. R package version 0.5.0. https://CRAN.R-project.org/package=htmltools
[16] Ian Fellows (2018). wordcloud: Word Clouds. R package version 2.6. https://CRAN.R-project.org/package=wordcloud
[17] Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
[18] C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.
[19] Joe Cheng (2020). crosstalk: Inter-Widget Interactivity for HTML Widgets. R package version 1.1.0.1. https://CRAN.R-project.org/package=crosstalk
[20] Garrick Aden-Buie (2020). ggpomological: Pomological plot themes for ggplot2. R package version 0.1.2. https://github.com/gadenbuie/ggpomological
[21] Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. R package version 1.1.1. https://CRAN.R-project.org/package=scales
[22] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.